scgpt

Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

Generative pre-trained models have achieved remarkable success in various domains such as natural language processing and computer vision. Specifically, the combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between linguistic constructs and cellular biology - where texts comprise words, similarly, cells are defined by genes - our study probes the applicability of foundation models to advance cellular biology and genetics research. Utilizing the burgeoning single-cell sequencing data, we have pioneered the construction of a foundation model for single-cell biology, scGPT, which is based on generative pre-trained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT, a generative pre-trained transformer, effectively distills critical biological insights concerning genes and cells. Through the further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference.

Haotian Cui , Chloe Wang , Hassaan Maan , Kuan Pang , Fengning Luo , Bo Wang ,

Monthly downloads

1000

scGPT ¶

This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.

Preprint

Documentation

!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.

Installation ¶

scGPT is available on PyPI. To install scGPT, run the following command:

$ pip install scgpt

[Optional] We recommend using wandb for logging and visualization.

$ pip install wandb

For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.

$ git clone this-repo-url
$ cd scGPT
$ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Pretrained scGPT Model Zoo ¶

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well.

Model name	Description	Download
whole-human (recommended)	Pretrained on 33 million normal human cells.	link
brain	Pretrained on 13.2 million brain cells.	link
blood	Pretrained on 10.3 million blood and bone marrow cells.	link
heart	Pretrained on 1.8 million heart cells	link
lung	Pretrained on 2.1 million lung cells	link
kidney	Pretrained on 814 thousand kidney cells	link
pan-cancer	Pretrained on 5.7 million cells of various cancer types	link

Fine-tune scGPT for scRNA-seq integration ¶

Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list ¶

Upload the pretrained model checkpoint
Publish to pypi
Provide the pretraining code with generative attention masking
Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
Example code for Gene Regulatory Network analysis
Documentation website with readthedocs
Bump up to pytorch 2.0
New pretraining on larger datasets
Reference mapping example
Publish to huggingface model hub

Contributing ¶

We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements ¶

We sincerely thank the authors of following open-source projects:

Citing scGPT ¶

@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}

Related notebook

Single-cell RNA sequencing methods can profile the transcriptomes of single cells but cannot preserve spatial information. Conversely, spatial transcriptomics assays can profile spatial regions in tissue sections but do not have single-cell resolution. Here, Runmin Wei (Siyuan He, Shanshan Bai, Emi Sei, Min Hu, Alastair Thompson, Ken Chen, Savitri Krishnamurthy & Nicholas E. Navin) developed a computational method called CellTrek that combines these two datasets to achieve single-cell spatial mapping through coembedding and metric learning approaches. They benchmarked CellTrek using simulation and in situ hybridization datasets, which demonstrated its accuracy and robustness. They then applied CellTrek to existing mouse brain and kidney datasets and showed that CellTrek can detect topological patterns of different cell types and cell states. They performed single-cell RNA sequencing and spatial transcriptomics experiments on two ductal carcinoma in situ tissues and applied CellTrek to identify tumor subclones that were restricted to different ducts, and specific T-cell states adjacent to the tumor areas.

T-cells

More

Required GPU

InferCNV is used to explore tumor single cell RNA-Seq data to identify evidence for somatic large-scale chromosomal copy number alterations, such as gains or deletions of entire chromosomes or large segments of chromosomes. This is done by exploring expression intensity of genes across positions of tumor genome in comparison to a set of reference 'normal' cells. A heatmap is generated illustrating the relative expression intensities across each chromosome, and it often becomes readily apparent as to which regions of the tumor genome are over-abundant or less-abundant as compared to that of normal cells. **Infercnvpy** is a scalable python library to infer copy number variation (CNV) events from single cell transcriptomics data. It is heavliy inspired by InferCNV, but plays nicely with scanpy and is much more scalable.

Basal respiratory cells

Ionocytes

More

Required GPU

Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology. Here, we will demonstrate a basic workflow to work with ***Geneformer*** models. These notebooks include the instruction to: 1. Prepare input datasets 2. Finetune Geneformer model to perform specific task 3. Using finetuning models for cell classification and gene classification application

Ionocytes

Respiratory ciliated cells

More

Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

Monthly downloads

scGPT ¶

Installation ¶

Pretrained scGPT Model Zoo ¶

Fine-tune scGPT for scRNA-seq integration ¶

To-do-list ¶

Contributing ¶

Acknowledgements ¶

Citing scGPT ¶

Related notebook

BioTuring

BioTuring

BioTuring